Repairing Data through Regular Expressions
نویسندگان
چکیده
Since regular expressions are often used to detect errors in sequences such as strings or date, it is natural to use them for data repair. Motivated by this, we propose a data repair method based on regular expression to make the input sequence data obey the given regular expression with minimal revision cost. The proposed method contains two steps, sequence repair and token value repair. For sequence repair, we propose the Regular-expressionbased Structural Repair (RSR in short) algorithm. RSR algorithm is a dynamic programming algorithm that utilizes Nondeterministic Finite Automata (NFA) to calculate the edit distance between a prefix of the input string and a partial pattern regular expression with time complexity of O(nm) and space complexity of O(mn) where m is the edge number of NFA and n is the input string length. We also develop an optimization strategy to achieve higher performance for long strings. For token value repair, we combine the edit-distance-based method and associate rules by a unified argument for the selection of the proper method. Experimental results on both real and synthetic data show that the proposed method could repair the data effectively and efficiently.
منابع مشابه
Repairing Regular Expressions by Adding Missing Words
Regular expressions are used in many information extraction systems like YAGO, DBpedia, Gate and SystemT. However, they sometimes do not match what their creator wanted to find. We investigate how missing words can be added automatically to a regular expression by creating disjunctions at the appropriate positions. Our demo visualizes the steps that our algorithm employs to repair the regular e...
متن کاملSemantics, analysis and security of backtracking regular expression matchers
Regular expressions are ubiquitous in computer science. Originally defined by Kleene in 1956, they have become a staple of the computer science undergraduate curriculum. Practical applications of regular expressions are numerous, ranging from compiler construction through smart text editors to network intrusion detection systems. Despite having been vigorously studied and formalized in many way...
متن کاملTime Extraction from Real-time Generated Football Reports
This paper describes a system to extract events and time information from football match reports generated through minute-byminute reporting. We describe a method that uses regular expressions to find the events and divides them into different types to determine in which order they occurred. In addition, our system detects time expressions and we present a way to structure the collected data us...
متن کاملCompile-Time Path Expansion in Lore
Semistructured data usually is modeled as labeled directed graphs, and query languages are based on declarative path expressions that specify traversals through the graphs. Regular (or generalized) path expressions use regular expression operators to specify traversal patterns. Regular path expressions typically are evaluated at run-time by exploring the database graph. However, if the database...
متن کاملNew Rewritings and Optimizations for Regular Path Queries
All the languages for querying semistructured data and the web use as an integral part regular expressions. Based on practical observations, finding the paths that satisfy those regular expressions is very expensive. In this paper, we introduce the “maximal partial rewritings” (MPR’s) for regular path queries using views. The MPR’s are always exact and more useful for the optimization of the re...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- PVLDB
دوره 9 شماره
صفحات -
تاریخ انتشار 2016